Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr sink Improvements #1713

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Zarr sink Improvements #1713

wants to merge 4 commits into from

Conversation

annehaley
Copy link
Collaborator

@annehaley annehaley commented Nov 1, 2024

Resolves #1674. Resolves #1698.

@annehaley annehaley marked this pull request as ready for review November 5, 2024 18:29
@manthey
Copy link
Member

manthey commented Nov 7, 2024

I have a multiprocess code path that behaves poorly.

In the main process, create the sink and add a tile.
Create a subprocess. It doesn't know that they are already tiles present, so the newaxes list is non-empty. Something uses a lot of memory as a tile is added.

I think the solution is that if new_axes (https://github.com/girder/large_image/pull/1713/files#diff-c79d7356628b5b310aaeb154520b413576727848e175b84f16af2b0eaadb98deR735) is not empty, we set the updateMetadata flag.

My test case was to use the examples/algorithm_progression.py example using --multiprocessing and to hack in an sink.addTile with a 1 pixel record at the maximum sweep locations. That is,

diff --git a/examples/algorithm_progression.py b/examples/algorithm_progression.py
index 52271cfb..2a70c0dc 100755
--- a/examples/algorithm_progression.py
+++ b/examples/algorithm_progression.py
@@ -42,7 +42,7 @@ class SweepAlgorithm:
 
         self.combos = list(itertools.product(*[p['range'] for p in input_params.values()]))
 
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         msg = 'Not implemented'
         raise Exception(msg)
 
@@ -118,7 +118,10 @@ class SweepAlgorithm:
 
     def run(self):
         starttime = time.time()
-        sink = self.getOverallSink()
+        source = large_image.open(self.input_filename)
+        maxValues = {'x': source.sizeX, 'y': source.sizeY, 's': source.metadata['bandCount']}
+        maxValues.update({p['axis']: len(p['range']) for p in self.param_order.values()})
+        sink = self.getOverallSink(maxValues)
 
         print(f'Beginning {len(self.combos)} runs on {self.max_workers} workers...')
         num_done = 0
@@ -149,7 +152,7 @@ class SweepAlgorithm:
 
 
 class SweepAlgorithmMulti(SweepAlgorithm):
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         os.makedirs(os.path.splitext(self.output_filename)[0], exist_ok=True)
         algorithm_name = self.algorithm.__name__.replace('_', ' ').title()
         self.yaml_dict = {
@@ -270,10 +273,14 @@ class SweepAlgorithmMultiZarr(SweepAlgorithmMulti):
 
 
 class SweepAlgorithmZarr(SweepAlgorithm):
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         import large_image_source_zarr
 
-        return large_image_source_zarr.new()
+        sink = large_image_source_zarr.new()
+        if maxValues:
+            sink.addTile(np.zeros((1, 1, maxValues.get('s', 1))),
+                         **{k: v - 1 for k, v in maxValues.items() if k != 's'})
+        return sink
 
     def writeOverallSink(self, sink):
         sink.write(self.output_filename, lossy=self.lossy)

and then python examples/algorithm_progression.py ppc --param=hue_value,hue,0,1,100,open --param=hue_width,width,0.10,0.25,4 -w 36 --multiprocessing build/tox/externaldata/sample_Easy1.png /tmp/sweep.zarr.zip

@manthey
Copy link
Member

manthey commented Nov 7, 2024

Hmm... If we did that, then maybe we'd also need to call self._validateZarr() in the _initNew method if we didn't create the file, but then things fail if we aren't setting data before doing the multiprocessing fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set frame values in multiple processes Modify Zarr sink addTile to allow creation of new axes
2 participants